Diagnosis of Lumbar Spondylolisthesis Using Optimized Pretrained CNN Models

Spondylolisthesis refers to the slippage of one vertebral body over the adjacent one. It is a chronic condition that requires early detection to prevent unpleasant surgery. The paper presents an optimized deep learning model for detecting spondylolisthesis in X-ray radiographs. The dataset contains a total of 299 X-ray radiographs from which 156 images are showing the spine with spondylolisthesis and 143 images are of the normal spine. Image augmentation technique is used to increase the data samples. In this study, VGG16 and InceptionV3 models were used for the image classification task. The developed model is optimized by utilizing the TFLite model optimization technique. The experimental result shows that the VGG16 model has achieved a 98% accuracy rate, which is higher than InceptionV3's 96% accuracy rate. The size of the implemented model is reduced up to four times so it can be used on small devices. The compressed VGG16 and InceptionV3 models have achieved 100% and 96% accuracy rate, respectively. Our finding shows that the implemented models were outperformed in the diagnosis of lumbar spondylolisthesis as compared to the model suggested by Varcin et al. (which had a maximum of 93% accuracy rate). Also, the developed quantized model has achieved higher accuracy rate than Zebin and Rezvy's (VGG16 + TFLite) model with 90% accuracy. Furthermore, by evaluating the model's performance on other publicly available datasets, we have generalised our approach on the public platform.


Introduction
Spondylolisthesis, the most prevalent immature spine condition, is characterised by the anterior displacement of lumbar vertebrae relative to adjacent vertebrae. Spondylolisthesis affects around 4% to 6% of the population [1][2][3]. Early detection of spondylolisthesis using radiographs may prevent from surgery. Availability of large amounts of multimodal data in the healthcare domain prompted researchers to create and deploy Artificial Intelligence (AI) algorithms in this sector [4]. e importance of AI methods in the healthcare sector has increased dramatically in recent decades [5,6].
Approaches for categorizing and detecting vertebral column diseases typically include image processing techniques. Image classification has long been a research hotspot, and Deep Learning (DL) methods provide a wide range of capabilities and flexibility that can be used in image classification [7]. Convolutional Neural Network (CNN) is the most popular type of Deep Neural Network (DNN) that uses multilayer pixel-based Artificial Neural Network (ANN) methods [8]. CNN contains one input layer, many hidden layers, and single output layer. It is widely used to classify images and outperform feature-based approaches in image classification, as well as giving promising results in medical imaging [9,10].
To build a strong CNN model, a lot of labelled training data as well as excellent picture quality is required [11,12]. ImageNet is a growing image database with 14 million pictures and 21841 synsets catalogued [13]. e ImageNet dataset has been used to construct a number of state-of-theart CNN networks, including, VGG16 [11][12][13][14] and Incep-tionV3 [15].
In terms of the development of extremely popular pretrained models for image classification, 2014 was a turning point. Two of the best models for image classification using Keras are VGG16 and InceptionV3 [16]. In that year's ILSVRC, VGG16 came in second place, while Google took first place with its model GoogLeNet (known as Inception now) [17]. Due to the popularity of these models, we have selected them for disease classification task in this study.
A private dataset comprising 299 spine X-ray images is used in this research. As there were less data to work with, data augmentation [18,19] and transfer learning [20][21][22] approaches were used to increase sample size. ese are significant methods to overcome the need for a large dataset in applications where data is limited, such as medical imaging.
To diagnose lumbar spondylolisthesis, two distinct CNN architectures, VGG16 and InceptionV3, were utilized in this study. TFLite is used to create a quantized model that requires less storage and offer a quick and accurate diagnosis of lumbar spondylolisthesis. e major goals of the paper are as follows: (1) Implement a flexible, quick, and quantized pretrained model to use on small devices and also compare the accuracy of the implemented algorithm with previous studies (2) Generalize the model on a public platform is paper is organized as follows: (1) introduction about the need for image classification, (2) literature review, (3) overviews of selected pre-trained CNN algorithms for the diagnosis of spondylolisthesis, (4) materials and methods, (5) experimental setups, (6) result analysis and discussion, and (7) conclusion.

Literature Review
Many researchers have proposed solutions for healthcare domain applications by utilizing DL models. Table 1 summarises all of the literature research discussed in this section.
Varcin et al. [23] used two well-known artificial neural networks, AlexNet and GoogLeNet, to solve the challenge of spondylolisthesis diagnosis. e model optimization technique is not used by the authors.
Cococi et al. [24] presented TensorFlow Lite for constructing intelligent medical devices by implementing MobileNetV3, ShuffleNetV2, and SlimNet models with Android to achieve a reasonable balance between accuracy and portability.
Cococi et al. [25] built an efficient recognition convolutional deep learning architecture integrated using Android and Raspberry Pi to run on portable, energy-efficient, resource-constrained platforms in the creation of intelligent medical equipment.
Basantwani et al. [26] have developed an Android application that employs a machine learning model to estimate COVID-19 in chest X-ray or CT scan. e final model was converted into a TFLite model which could be used in making the Android model.
Verma et al. [27] have created an innovative Android application that uses a very efficient and accurate DL algorithm to identify COVID-19 infection from chest CT images. e model generates a TensorFlow lite flat buffer file (.tflite) which is used to decrease the model's size, and the model is optimized for speed and latency on edge devices.
Bushra et al. [28] developed a CNN model and then converted it to TensorFlow Lite (TFLite) model to deploy on Android mobile.
Zebin and Rezvy [29] used multiple pre-trained convolutional backbones as the feature extractor to discriminate COVID-19 and Pneumonia-related inflammation in the lungs from normal inflammation.
Sharma et al. [30] have developed a model with the goal of detecting the existence of three pathologies, namely, Diabetic Macular Edema (DME), Choroidal Neovascularization (CNV), and Drusen and classified them using OCT (Optical Coherence Tomography).
We explored literature review based on models used in medical disease diagnosis using X-ray images because there is only one study in our field. After reviewing the literature, we have got that many researchers have utilized TFLite for model optimization technique for the diagnosis of different diseases based on an X-ray image dataset and achieved good accuracy (ranges between 90 and 99.38%).

Pre-Trained VGG16
Model. VGG16 is a six-stage pretrained model. Two convolution layers along with a maxpooling layer of stride 2 are used in the starting two stages.
ree convolution layers with a max-pooling layer of stride 2 are used in the next three phases. ree fully connected layers make up the final stage. e convolution layers have a size of 3 × 3 filters with a stride of 1. Except for stage 5, each level doubles the number of filters starting at 64 [31][32][33][34][35][36]. Figure 1 shows the architecture of VGG16 model which accepts spine X-ray image of dimension of 224 × 224 × 3 and after feature extraction the model is fine-tuned for binary classification of spondylolisthesis dataset.

Pre-trained InceptionV3 Model.
e InceptionV3 network is made up of various modules that enable more efficient computation and deeper networks by using stacked 1 × 1 convolutions to reduce dimensionality. Some operations, such as 1 × 1, 3 × 3, and 5 × 5 convolutions and max pooling, are done in parallel and chained. "Inception layer" is the name given to this concatenation [15], [22], [37], [38].
InceptionV3 model accepts spine X-ray image as input of dimension of 299 × 299 × 3 and after feature extraction and fine-tuning it gives the binary classification of spine X-ray image as output using SoftMax function. Figure 2 explains the architecture of Inception layer used for spondylolisthesis dataset.
Some characteristics of cutting-edge pre-trained CNN networks, VGG16 and IncepsionV3, are shown in Table 2 [39,40]. e selected models were preloaded with ImageNet weights and then fine-tuned for binary classification task. Categorical Cross-Entropy Loss (CE) is used to train both models.

Materials and Methods
By optimizing and compressing the size of a pre-trained transfer learning model with TFLite, we were able to develop a quantized model which can be used on small devices. Figure 3 illustrates the block diagram for the planned process.
In the first stage, radiographic images with the spine X-ray are taken and kept in the acquire-data stage. Data augmentation is employed to enhance the number of data samples in the second stage, and pre-trained models are used to extract decision features.
After training a partial dataset, proposed models are assessed for efficiency in the next stage. en, using TFLite, the tested model is reduced up to four times to provide a

Radiographic Image Acquisition.
A real-time dataset was collected from our own private collection of X-rays for this investigation. Physiotherapy and rehabilitation professionals categorized the collected radiographic images of the vertebrae as healthy or spondylolisthesis images. Some of the radiographs were removed because they were not technically sound. Some of the images from the dataset along with their classifications are shown in Figure 4. e dataset contains a total of 299 spine X-ray images in various diameters. It includes 156 images of people with spondylolisthesis and 143 images of healthy people (without spondylolisthesis). e radiographs were resized with 224 × 224 × 3 dimensions, in order to create images of vertebral columns mainly focusing on L4-L5 and L5-S1 vertebra [41]. Table 3 lists the features of the final dataset.
For effective prediction, all DL models require large sets of data. We utilized data augmentation to generate an adequate number of images from our dataset for proper diagnosis of disease.

Data Augmentation.
In order to obtain adequate datasets, the data augmentation technique is used to generate more data based on image processing technologies. e original data from which the additional training data were generated are labelled in this augmentation approach [18,19].
While improving overall performance, visual data augmentation prevents CNN from learning irrelevant patterns, overfitting, and retaining the specific properties of the training images. Cropping, translating, and reflecting the image are only a few of the data augmentation procedures. In this study, 701 additional images were created as a result of the data augmentation.

Transfer
Learning. CNN model suffers from issues related to lack of dataset diversity and quantity. e goal of transfer learning is to impart knowledge in a domain by using a large amount of training data [12], [18].

Quantization Using TFLite.
A native TensorFlow Lite quantization can be used to optimize a model [42]. It is used to transform the whole model into a flat buffer. A computer uses 32-bit floating-point representation of a real number for most purposes; quantization is a novel concept that transforms these 32-bit floating-point values to 8-bit integers with minor or no accuracy loss. is results in a huge reduction in the model's size [43]. Figure 5 shows the entire process of model compression. e initial step is to log the data before compression. e trained model is converted into a TensorFlow Lite model   In the second phase, we measure the performance of quantized (TFLite) model. e model is loaded into the interpreter to test it on a single image. Finally, the model is evaluated for whole dataset and the accuracies of base and quantized models were compared to check the difference.

Experimental Design.
is experiment was built on Python3 in a Windows environment using the Google Colab platform. e current version of TensorFlow, a DL framework, is 2.5.0. Accuracy/loss curves will be displayed using the pyplot module from the Matplotlib package, which offers a MATLAB-like interface to the underlying object-oriented charting library. To get the desired plot, it generates figures and axes implicitly and automatically.

Data Splitting.
Dataset is split into standard 70 : 30 ratio with 224 × 224 × 3 dimensions and separated into three groups, training set (700), test set (50), and validation set (250), using the train_test_split() method, with test step 50 and test batch 1. Table 4 illustrates the statistics of split dataset.

Model Training.
e first step is to build a model that was created from a large number of datasets. e training set is used to train the model and test set is used to test the model for the image classification task. Test data is applied to  Computational Intelligence and Neuroscience determine the performance of the specified algorithms using the above mentioned training parameters.

Experimental Results.
In this study, two pre-trained transfer learning models from classical and modern architectures were utilized (as explained in the Model Architecture section). e performance of each model is evaluated in terms of accuracy/loss graphs and confusion matrix.

Training Accuracy/Loss.
e accuracy/loss subplot shows the continuous learning of a model. Selected models were tested for 5 epochs in this experiment. VGG16 Training Accuracy/Loss. According to Figure 6, VGG16 has achieved a maximum accuracy of 98% with a training loss of 0.08, that is, model learned effectively and properly distinguish between spondylolisthesis and normal cases.
InceptionV3 Training Accuracy/Loss. Figure 7 indicates that InceptionV3 has achieved 96% accuracy with a training loss of 0.08. It shows that our model has learned effectively but it is less accurate to classify between spondylolisthesis and normal cases.

Confusion Matrix. Confusion matrices of VGG16 and
InceptionV3 models are displayed in Figure 8(a) and Figure 9(a), respectively. A total of 50 X-ray radiographs were used in the test set (28 spondylolisthesis and 22 normal). In the confusion matrix, actual cases were arranged in rows, whereas predicted cases were arranged in columns. Also, class 0 and class 1 indicate normal and spondylolisthesis cases, respectively.

VGG16's Confusion Matrix and Classification Report.
In the context of VGG16 (Figure 8(a)), out of 22 normal patients, the model correctly identified 21 and misclassified 1 case as spondylolisthesis. All typical instances had their precise class label predicted by the model.

InceptionV3's Confusion Matrix and Classification Report.
In the case of InceptionV3 (Figure 9(a)), the model correctly identified 26 of 28 spondylolisthesis patients and misclassified 2 cases as normal. e model correctly identified all the normal cases.

Assessment of Performance Using Metrics.
Accuracy, precision, recall, and F1-score were utilized to evaluate the performance of selected models in this work. ese metrics are calculated using the following formulae. Accuracy: the number of correct predictions divided by the total number of predictions is known as accuracy.

Computational Intelligence and Neuroscience
Precision: the accuracy of the prediction is measured by precision.
Pricision � TP (TP + FP) . (2) Recall: the recall of the detector is measured by how well it discovers all ground truth.
F1 score: when you need a quick way to compare two classifiers, it is frequently easier to combine accuracy and recall into a single statistic called the F1-score. e harmonic mean of precision and recall is used to get the F1-score.

Model Compression Using TFLite. VGG16 and
InceptionV3 models were compressed up to four times for use on small devices. A helper function is used to evaluate the performance of the compressed model on the test dataset. Table 6 illustrates the comparison between original model and compressed model.

Performance Comparison Using Publicly Available
Dataset. Kaggle's Pneumonia dataset is used to compare the findings of our outperformed model. Some samples from the selected dataset are displayed in Figure 10.
Out of 5232 images, 3883 images are of Pneumonia patients and 1349 images are of normal patients. Using a conventional 70 : 30 ratio, the dataset is separated into training and test sets, and the test set is further segmented into test and validation sets. e statistics of the split dataset are described in Table 7.

Training Accuracy/Loss.
e training accuracy/loss graph for the outperformed model (VGG16) is shown in Figure 11.

Confusion Matrix. Confusion matrix of a selected
Pneumonia dataset is shown in Figure 12. It is self-evident that the VGG16 model correctly categorised the cases as normal and Pneumonia patients.

Compressing the Model Using TFLite.
e fine-tuned VGG16 model is compressed four times to form a quantized model. Table 8 shows the size and accuracy of the base and quantized models.
According to above table, the quantized model has achieved 100% accuracy which validates our previous finding. e implemented quantized model worked similarly for both the private (spondylolisthesis) and public (Pneumonia) datasets.

Result Analysis and Discussion
In this study, two pre-trained transfer learning CNN models, VGG16 and InceptionV3, were selected for the disease classification task. Spine X-ray radiographs of normal and spondylolisthesis patients were used to train, validate, and test these models. e data augmentation technique is used to create enough images (total of 1000 samples). e goal was to accurately diagnose the disease and assess the performance of the selected model. According to above figures and tables, VGG16 and InceptionV3 have achieved 98% and 96% accuracy rates, respectively.   Computational Intelligence and Neuroscience In prior study, Varcin et al. [23] have employed two distinct networks, AlexNet and GoogLeNet, for spondylolisthesis diagnosis on their private datasets. According to the research, GoogLeNet is somewhat more successful than AlexNet by attaining 93% accuracy rate. Our results outperform the prior work by attaining a peak accuracy of 98%.
Both models were compressed up to four times using TFLite converter. Our finding shows that there is minor (2% increase in case of VGG16) or no difference (in case of InceptionV3) in accuracies of original model and quantized model. According to literature survey (Table 1), the suggested model VGG16 + TFLite attained 100% accuracy, which is higher than Zebin and Rezvy's [29] 90% accuracy for the same model. e model has been applied to a publicly available dataset. Pneumonia dataset has a larger amount of data than our spondylolisthesis dataset, and the accuracy attained by the Pneumonia's VGG16 model is higher than spondylolisthesis's VGG16 model. On the basis of our findings, we   can conclude that implemented quantized model is more reliable and efficient for disease classification in general.

Conclusion
In this study, the performances of two deep neural networks, VGG16 and InceptionV3, were compared for spondylolisthesis diagnosis. Data augmentation is used to increase the sample size. VGG16 model has achieved 98% accuracy rate, which is higher than InceptionV3's 96% accuracy rate. Also, we have applied quantization to reduce the model size up to four times. e implemented models outperformed prior studies. Moreover, we have generalized the model on the public platform. Although these models may be used as a substitute for manual radiological analysis and can help clinicians to diagnose spondylolisthesis from spine X-ray data automatically, further study is needed for grading spondylolisthesis through X-ray images.

Conflicts of Interest
e authors state that there are no commercial or financial ties that might be interpreted as possible conflicts of interest in the research.